-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] Normalize block types before internal multi-block operations #43764
Conversation
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
# If block types are different, but still both of TableBlock type, try | ||
# converting both to default block type before zipping. | ||
self_default, other_default = self.to_default(), acc.to_default() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call normalize_block_types
instead to keep code to be consistent?
self_default, other_default = self.to_default(), acc.to_default() | ||
return BlockAccessor.for_block(self_default).zip(other_default) | ||
else: | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In whic case, this ValueError
will be triggered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for Block
s which do not extend TableBlock
class, i think this will be the case. since both ArrowBlock
and PandasBlock
are TableBlock
s themselves, this isn't an issue for these classes, but this would cover any case in which we have other types of Block
s in the future. i can also remove this if we think it's not useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, it's fine to keep it.
seen_types = set() | ||
for block in blocks: | ||
acc = BlockAccessor.for_block(block) | ||
assert isinstance(acc, TableBlockAccessor), type(acc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to throw an actionable error message instead of assert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c21 do we still use non-table blocks anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still use non-table blocks anywhere?
Actually I don't think we use non-table blocks anywhere.
else: | ||
results = [BlockAccessor.for_block(block).to_default() for block in blocks] | ||
|
||
assert all(isinstance(block, type(results[0])) for block in results) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
seen_types = set() | ||
for block in blocks: | ||
acc = BlockAccessor.for_block(block) | ||
assert isinstance(acc, TableBlockAccessor), type(acc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c21 do we still use non-table blocks anywhere?
def test_zip_multiple_block_types(ray_start_regular_shared): | ||
df = pd.DataFrame({"spam": [0]}) | ||
ds_pd = ray.data.from_pandas(df) | ||
ds2_arrow = ray.data.from_items([{"ham": [0]}]) | ||
assert ds_pd.zip(ds2_arrow).take_all() == [{"spam": 0, "ham": [0]}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to a different test module? I don't think zip
is an all-to-all operation?
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
Why are these changes needed?
Applying grouping operations on
Dataset
s with different underlyingBlock
types can cause exceptions (e.g. variousAttributeError
s) due toBlockAccessor
s assuming that all input block types are of the same type.We handle this case by normalizing the blocks (either
ArrowBlock
orPandasBlock
) to the first block type before applying the rest of the grouping/aggregation logic.Related issue number
Closes #31550
Closes #39206
Closes #39155
Closes #39291
Inspired by #39960
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.